Module 4.1: Multiple Regressors
Old Dominion University
In the last module, we developed quite a few models to explain housing prices. However, there are many factors that will impact the price of a house. Again, an important feature of economic models is that they’re relatively simple. However, our previous models were too simple. To make our model a bit more realistic, we can include other factors that might influence price. First, let’s read in the data, remove some columns, and change the names of the remaining columns.
In addition, we are going to create a new variable for age.
We’ll start by estimating a simple model where price is determined by age. Next, we’ll estimate a model where price is determined by age and square footage. Before we estimate anything, though, let’s first establish some of the correlations between the variables.
Why is this so important? Let’s first estimate a simple model:
\[\text{Price}_i = \alpha_0 + \alpha_1 \times \text{Age}_i + \epsilon_i\]
\(\alpha_1\) = -1474.96 suggests that for every one year increase in property age, housing price decreases by $1474.96. However, when we are talking about an older house, we are also talking about generally smaller homes (as evidenced via the correlation). This means that the regression coefficient of -1474.96 is actually a composite of the effect of increased age and decreased square footage! In other words, when we increase age, two things happen: age increases but square footage also decreases. So which effect are we really picking up?
Obviously, in reality, houses do not shrink as they age, but the model doesn’t know that! It only understands the world through the data and correlations it observes. In these data, older houses are also smaller, so it’s important to control for the size of a home to get a true measure of the effect of age on prices.
Another way to think about this is that you are trying to compare houses that are the same in every way except age. This would be like a Randomized Control Trial (RCT). “Controlling” for square footage gets us closer to mimicking an RCT by removing the effect of changes in square footage.
How do we control for square footage?
To get an estimate of the isolated effect of age on price, we need some extra information. First, we need the initial regression where price is a function of age. Next, we need to know how square footage will respond to changes in age. To get this information, we’ll estimate a regression where square footage is the outcome and age is the explanatory variable. Lastly, we need to know how price responds to changes in square footage.
The results:
This is a rough approximation because all of those regressions are inherently biased for the same reason that the first one is. So, rather than estimating three regressions and doing some major algebratics1, we can do this all in one step. We are going to estimate the following model:
\[\text{Price}_i = \beta_0 + \beta_1 \times \text{Age}_i + \beta_2 \times \text{Sq. Ft.}_i + \epsilon_i\]
Without going through all the math, OLS will simultaneously find two lines of best fit. In fact, this becomes a surface, or a plane, since there are two dimensions instead of just one. \(\beta_0, \beta_1, \text{and} \ \beta_2\) will still minimize the sum of squared residuals and provide an average residual of zero.
Now let’s estimate the model in R. To include another variable in lm(), simply add it to the right side of the formula like you would a math equation.
(Intercept) age sqft
79973.60078 -1087.23673 95.96949
This model also makes intuitive sense: it says that older houses are worth less and larger houses are worth more.
Let’s practice with another variable: bedrooms. Let’s estimate a model of the form:
\[\text{Price}_i = \gamma_0 + \gamma_1 \times \text{Sq. Ft.}_i + \gamma_2 \times \text{Bedrooms}_i + \epsilon_i\]
Let’s estimate three models:
r1 <- lm(price ~ sqft, ames)
r2 <- lm(price ~ bedrooms, ames)
r3 <- lm(price ~ sqft + bedrooms, ames)
regz <- list(`Price` = r1,
`Price` = r2,
`Price` = r3)
coefz <- c("sqft" = "Square Footage",
"bedrooms" = "Bedrooms",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
library("modelsummary")
modelsummary(regz,
title = "Effect of Sq. Ft. and Bedrooms on Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)| Price | Price | Price | |
|---|---|---|---|
| Square Footage | 111.694*** | 136.361*** | |
| (2.066) | (2.247) | ||
| Bedrooms | 13889.495*** | −29149.110*** | |
| (1765.042) | (1372.135) | ||
| Constant | 13289.634*** | 141151.743*** | 59496.236*** |
| (3269.703) | (5245.395) | (3741.249) | |
| Num.Obs. | 2930 | 2930 | 2930 |
| R2 | 0.500 | 0.021 | 0.566 |
In the third model, we find:
Wait a minute… More bedrooms decrease sale price? Initially, this might be confusing, so let’s think about this carefully. In words, this coefficient’s interpretation is:
Increasing the number of bedrooms by one, holding square footage constant, decreases sale price by $29,149.11.
What does it mean to increase the number of bedrooms in a property while holding square footage constant? Suppose a home is 2,000 square feet with four rooms. Each room is 500 square feet (on average). If we add an additional room, each room would only be 400 square feet (on average). Therefore, adding an extra bedroom, while holding square footage constant, makes for a bunch of small rooms. This is not something that is typically sought after on the housing market, hence the negative coefficient.
Below is a progression of three models:
r1 <- lm(log(price) ~ log(sqft), ames)
r2 <- lm(log(price) ~ log(sqft) + bedrooms, ames)
r3 <- lm(log(price) ~ log(sqft) + bedrooms + age, ames)
regz <- list(`log(Price)` = r1,
`log(Price)` = r2,
`log(Price)` = r3)
coefz <- c("log(sqft)" = "log(Square Footage)",
"bedrooms" = "Bedrooms",
"age" = "Age")
gofz <- c("nobs", "r.squared")
library("modelsummary")
modelsummary(regz,
title = "Determinants of Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)| log(Price) | log(Price) | log(Price) | |
|---|---|---|---|
| log(Square Footage) | 0.908*** | 1.094*** | 0.875*** |
| (0.016) | (0.018) | (0.015) | |
| Bedrooms | −0.138*** | −0.082*** | |
| (0.007) | (0.006) | ||
| Age | −0.006*** | ||
| (0.000) | |||
| Num.Obs. | 2930 | 2930 | 2930 |
| R2 | 0.523 | 0.580 | 0.731 |
Take a look at how the \(R^2\) changes from model to model. Each new control variable adds a little bit more information, which improves the model’s ability to explain. As a way to visualize this, we can plot the fitted values against the true, observed values. If the model was perfect, all the points would fall on the red 45°, \(y = x\) line.
par(mfrow = c(1, 3))
ylim <- exp(range(r1$fitted.values, r2$fitted.values, r3$fitted.values))
plot(ames$price, exp(r1$fitted.values), ylim = ylim,
ylab = "Fitted/Predicted Price", xlab = "",
main = "Controls: sqft",
col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
legend = paste0("R2: ", round(summary(r1)$r.squared, 3)))
plot(ames$price, exp(r2$fitted.values), ylim = ylim,
xlab = "Observed Price", ylab = "",
main = "Controls: sqft + bedrooms",
col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
legend = paste0("R2: ", round(summary(r2)$r.squared, 3)))
plot(ames$price, exp(r3$fitted.values), ylim = ylim,
ylab = "", xlab = "",
main = "Controls: sqft + bedrooms + age",
col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
legend = paste0("R2: ", round(summary(r3)$r.squared, 3)))As control variables are included, the \(R^2\) increases and the points start to get tighter to the diagonal line. This means that the predicted values are getting closer to the actual values.
Points that are above the 45° line are expected to have higher sale prices than what they actually sold for. Points below the line sold for higher prices that what the model predicted.
Imagine you are a data scientist for Zillow, and you can observe characteristics about homes for sale and the list price. You could estimate a model to explain list price with square footage, bedrooms, age, size of the backyard, whether the property has a pool, etc. Then, for each property, you could generate a fitted sale price. If the fitted price is above the list price, this represents an underpriced home for sale (and an investment opportunity). On the other hand, a fitted price below a list price represents an overpriced home.
In 2018, Zillow started “Zillow Offers”, a division of their company dedicated to buying and selling homes based on pretty much the exact logic outlined above. However, Zillow failed in a major way.
Why? What happened? There were ultimately a lot of reasons for why this did not work, but a major reason was omitted variable bias. As an extreme, facetious example, consider 100 properties that are all identical. Of course, Zillow’s model will therefore predict the exact same sale price for each of these homes. However, the list prices of these homes will certainly not be the same. First, people are not robots and have different circumstances. This is going to introduce random noise into the list prices. Also, homeowners know a lot about their own homes compared to what Zillow can observe in data. This is going to introduce non-random noise into the list price.
Zillow was betting on the random part outweighing the non-random part. For example, suppose Zillow identifies the five most under-priced homes out of the lot of one hundred. These homes could be under-priced because the homeowners need to move ASAP for one reason or another and are willing to sell below market value. Or, these homes could be under-priced because the home is infested with termites and smells like a sewer. Not that Zillow was buying a bunch of termite traps, but their models were unable to control for everything that impacts a price of a home and they were selecting homes that were under-priced for non-random reasons.
Of course, there are markets where this sort of logic can work as intended. In the beginning of the pandemic (Spring 2021), NFTs gained a lot of popularity. Specifically, NBA TopShot was incredibly popular.
NBA TopShot was a marketplace for the internet equivalent of basketball cards. The NFTs being traded (bought and sold) were highlights of basketball players dunking, shooting, defending, etc. Each highlight had a limited number released. For example, this highlight of Keldon Johnson only had 12,000 official copies. Prices of these NFTs depended on the player, the highlight, the rarity, and the serial number (#1 being the most expensive and #12,000 being the least expensive).
A clever econometrician could estimate a model for sale price and compare the fitted/predicted values to the listed prices of each NFT. NFTs with a list price under the predicted sale price could then be bought at a discount and immediately re-listed for the predicted sale price. Arbitrage.
Let’s use example data from a single NFT from 3/15/2021. By limiting it to only one NFT highlight, the prices are now completely determined by serial number. Below is a plot of serial number and list price.
This plot shows that most NFTs are relatively cheap with some very expensive ones. To get a clearer picture of the relationship, we can plot the data in a log-log format.
It would be unreasonable to purchase an NFT with serial number \(s\) for price \(p\) if you could purchase an NFT with a lower serial number \(s-\epsilon\) for the same price \(p\) (or lower). Remember, the lower the serial number, the higher the value! So, let’s only consider NFTs where we cannot get a lower serial number without paying more. These will be colored in red.1
Once we have these “reasonable” NFTs picked out from the rest, we can fit a log-log regression through the points. This estimated parameters of this regression are:
(Intercept) logSerialNumber
9.0936344 -0.5453862
Now we are in a very similar situation as Zillow except with one key difference: all of these NFTs are the same! If we find an undervalued NFT, it is not because it is infested with termites or because it smells like a sewer. It is purely due to randomness in pricing by individuals.
We can visualize this relationship below. Points that are below the line are under-priced. Points above the line are over-priced. The distance from the line tells us the severity of the under- or over-pricing. We want to identify the point that is the farthest below the line so we can buy it and re-sell it.
The NFT that is farthest from the line is the circled one. This NFT’s serial number is 138 and was on sale for $494. The model wants to re-sell it for $599.
Of course, this sort of model is hypothetical and old news (the TopShot market has since crashed). However, the point I am trying to make is that Zillow omitted important variables in their models are the results were disastrous.
Hopefully by now, you’re convinced that omitted variable bias is a problem for prediction. Also, after seeing what happened in the age and square footage example, hopefully you can see that omitted variables can also bias our coefficient interpretation.
Unfortunately, sometimes we cannot observe every variable that we’d like to control for in our regression. However, we can sign the bias of our coefficients if we can anticipate correlation coefficients.
Consider the age and square footage example. Initially, our omitted variable was square footage. We knew that square footage was positively correlated with price but negatively correlated with age. So, when we put square footage into the model, the coefficient for age went from \(-1474.96\) to \(-1087.24\). So, the omitted variable bias was negative since it was pushing the coefficient down in the negative direction.
The same thing happened with square footage and bedrooms. Bedrooms and price are negatively correlated, but bedrooms and square footage are positively correlated. So, our coefficient moved from \(111.69\) to \(136.36\), which is the same direction as the previous coefficients. Therefore, omitting bedrooms also caused negative omitted variable bias in this regression.
If you have a model \(Y = \delta_0 + \delta_1\times A + \epsilon\) and \(B\) is your omitted variable, the bias in \(\delta_1\) will be:
| cor(A, B) > 0 | cor(A, B) < 0 | |
|---|---|---|
| B’s effect on Y is positive | \(+\) | \(-\) |
| B’s effect on Y is negative | \(-\) | \(+\) |
Remember: positive (negative) bias means the magnitude of the coefficient is artificially pushed upwards (downwards).
Now we know the effect of omitting important variables on both our predictions and coefficient interpretations. But what is the effect of the opposite: including unimportant variables?
Including unimportant variables will not bias your other coefficients because if they’re truly unimportant, we would expect \(\beta_i = 0\). Take a look at the above table – if \(B\) does not effect \(Y\) whatsoever, then the bias will be neither positive nor negative regardless of \(B\)’s correlation with \(A\).
If this is the case, why don’t we throw every single variable we can think of into the model?! Well, there’s no such thing as a free lunch. Each time we include a new variable, especially when the new variable is highly correlated with one or more other variables, we lose a little bit of our precision. In other words, including new variables generally increases the standard errors of the coefficients for the other variables in the model. This makes it more difficult to effectively test our hypotheses.
Intuitively, adding two variables into a model that are highly correlated supplies the model with overlapping and redundant information. This makes it difficult for the model to parse out the effect of each variable individually. The problem of redundant information in a model is called multicollinearity. If two variables contain exactly the same information, the regression cannot differentiate between the two variables and therefore cannot be estimated without removing one of the variables. This is called perfect multicollinearity.
Multicollinearity occurs for one of two reasons:
Call:
lm(formula = price ~ age + yr_built, data = ames)
Coefficients:
(Intercept) age yr_built
239269 -1475 NA
NA. This is because R could not estimate it and drops it from the model.
To demonstrate the effects of multicollinearity on model output, we are going to simulate some data. Basically, we are going to simulate the following data generating process:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\]
where \(\beta_0 = 0\) and \(\beta_1=\beta_2=1\).
The first time we run the model, \(X_1\) and \(X_2\) will be completely uncorrelated. However, we are going to use a for loop to re-run the model over and over, increasing the correlation between the two variables after each iteration. The code is not necessarily important, but you can still hopefully follow along.
set.seed(757)
N <- 100 # sample size
x1 <- rnorm(N, 0, 1) # create X1
e <- rnorm(N, 0, 3) # create unobserved error
all1 <- list()
for(a in seq(0, .99, 0.01)){
# X2 is a combination of randomness and X1
# When a = 0, X2 and X1 are completely uncorrelated
x2 <- ((1-a)*rnorm(N, 0, 1)) + (a*x1)
rho <- cor(x1, x2) # calculate correlation
y <- x1 + x2 + e # Data Generating Process
# Estimate regression and store coefficients
tmp <- as.data.frame(coef(summary(lm(y ~ x1 + x2))))
# Store correlation
tmp$a <- rho
all1[[length(all1) + 1]] <- tmp # save results
}
# Combine results
all <- do.call(rbind, all1)
# Round off correlation
all$a <- round(all$a, 2)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
par(mar = c(4.1, 4.1, 1.1, 1.1))
plot(all$a[c(T,F,F)], las = 1, pch = 19,
col = scales::alpha("black", 0.6), cex = 1.25,
xlab = "Loop Iteration Number",
ylab = "Corr. Between X1 and X2")
abline(h = c(0, 0.5, 0.9), lty = 2)
plot(all$a[c(F,T,F)], all$Estimate[c(F,T,F)], las = 1,
xlab = "Correlation", cex = 1.25,
ylab = "Coefficient Estimate", pch = 19,
ylim = range(all$Estimate[c(F,T,F)][-which(all$Estimate[c(F,T,F)] == max(all$Estimate[c(F,T,F)]))]),
col = scales::alpha("black", 0.6))
abline(h = 1)
plot(all$a[c(F,T,F)], all$`t value`[c(F,T,F)], las = 1,
xlab = "Correlation", cex = 1.25,
ylab = "t-Statistic", pch = 19,
col = scales::alpha("black", 0.6))
abline(h = 1.96, lty = 2)
layout(matrix(c(1), 1, 1, byrow = TRUE))Examining these three figures:
All three of these plots provide evidence that multicollinearity does not bias coefficients (until it becomes extreme), but does hinder the precision of the estimates.
To summarize the effect of including (or not including) additional regressors:
The way you should think about adding new variables to your model is through a bias-variance tradeoff. We know that coefficients can be biased if we omit important variables, but our standard errors get larger as the number of parameters we have to estimate increases.
It’s important to consider only the most important variables in your model to minimize both bias and variance. You might ask: how can I tell which variables are most important? To that I would say: good question, and the best answer is economic theory.
Economic theory should guide your choices about which variables to include/exclude and which functional forms you should use (log(), x + x^2, etc.). Unfortunately, this might seem like an unsatisfying answer, but econometrics is as much an art as it is a science. As you become more comfortable with econometrics, selecting variables and functional forms will become more natural.
In the next portion of the module, we will discuss how to incorporate binary/categorical variables into our models.
ECON 311: Economics, Causality, and Analytics